Classification and prediction for multi-cancer data with ultrahigh-dimensional gene expressions

Analysis of gene expression data is an attractive topic in the field of bioinformatics, and a typical application is to classify and predict individuals’ diseases or tumors by treating gene expression values as predictors. A primary challenge of this study comes from ultrahigh-dimensionality, which makes that (i) many predictors in the dataset might be non-informative, (ii) pairwise dependence structures possibly exist among high-dimensional predictors, yielding the network structure. While many supervised learning methods have been developed, it is expected that the prediction performance would be affected if impacts of ultrahigh-dimensionality were not carefully addressed. In this paper, we propose a new statistical learning algorithm to deal with multi-classification subject to ultrahigh-dimensional gene expressions. In the proposed algorithm, we employ the model-free feature screening method to retain informative gene expression values from ultrahigh-dimensional data, and then construct predictive models with network structures of selected gene expression accommodated. Different from existing supervised learning methods that build predictive models based on entire dataset, our approach is able to identify informative predictors and dependence structures for gene expression. Throughout analysis of a real dataset, we find that the proposed algorithm gives precise classification as well as accurate prediction, and outperforms some commonly used supervised learning methods.


Introduction
Analysis of gene expression data is an important topic in bioinformatics. A large body of research and relevant developments have been explored in recent years. One of important branches of gene expression data analysis is to take gene expression values as predictors to classify and predict tumors to possible cancers. A motivated example in this paper is the GCM dataset, which contains 16,063 gene expression values and 14 human cancers among 198 tumor samples. The goal of this study is to take gene expression values as the predictors, and use them to classify tumor samples to their corresponding cancers. In this dataset, a key feature is ultrahigh-dimensional predictors in the sense that the dimension of predictors (number of gene expression values) is extremely greater than the sample size ( further induces some challenges, including (a) pairwise interactions among gene expressions and (b) existence of non-informative gene expressions, that affect the performance of classification and the accuracy of prediction.
To address classification and prediction for biomedical research, many supervised learning methods have been developed and have been widely applied in machine learning frameworks. With the ignorance of pairwise interactions and existence of non-informative predictors induced by ultrahigh-dimensional predictors, [1] proposed the integration of several heterogeneous cancer series, and performed a multi-class classification. [2] studied multicategory support vector machine (SVM) for the classification of multiple cancer. [3] presented comprehensive discussions of SVM methods. [4] applied SVM ensembers to analyze breast cancer prediction. [5] discussed linear discrimination analysis (LDA) and its application in the microarray. [6] discussed the multi-class analysis by generalized sparse linear discriminant analysis. The detailed and fundamental discussions of those methods can be found in [7,8], and were reviewed by [9] as well. In recent years, deep learning approaches, such as convolutional neural network (e.g., [10]) or natural language processing (e.g., [11]), have been developed to deal with multicalssification. More applications can be found in some monographs, such as [12][13][14].
To characterize pairwise interactions among gene expressions, which usually refers to the network dependence among gene expressions, we employ graphical models that are powerful methods in describing the dependence structure of variables. A general introduction of graphical models can be found in [7] (Chapter 17). In the past literature, graphical models have been used to deal with the classification problem. For example, [15] proposed the network-based support vector machine for the classification of microarray samples for binary classification. [16] discussed the identification of rheumatoid arthritis-related genes by using a networkbased support vector machine. [17] proposed network linear discriminant analysis. [18] proposed the nearest neighbor network. Most existing methods focused on binary responses and restricted the predictors to follow the normal distribution because of explorations of the precision matrix. Furthermore, it is intuitive to understand that the network structure of variables in different classes may not be exactly equal to each other. To address this issue, [19,20] explored SVM and logistic regressions with heterogeneous network structures accommodated, respectively. More recently, [21,22] developed multiclass discriminant analysis with network structures accommodated. From the perspectives of Bayesian approaches, several methods were also investigated with the network structure incorporated, including [23,24].
To address non-informative gene expression values in ultrahigh-dimensional data, variable selection or dimension reduction are perhaps commonly used strategies in the past literature. For example, [25] applied unsupervised feature extraction, such as principal component analysis, tensor decomposition, and kernel tensor decomposition, to select potentially important genes. [26] adopted SIS method to do feature screening for gene expressions and combined Nottingham Prognostic Index with a hybrid signature accommodated. With the combination of supervised learning, [27] proposed the penalized method for SVM. [28,29] explored variable selection based on LDA. Those methods mainly handled the setting that the dimension is smaller than the sample size, however, it is unknown whether those methods are able to deal with the case that the dimension of predictors is much higher than the sample size.
From the two challenges and developments described above, we note that most existing methods deal with either network structure or variable selection but not both. It motivates us to propose a strategy to simultaneously retain important predictors and construct the network structure of predictors when doing classification. Our strategy is outlined in Fig 1. Roughly speaking, (i) to deal with ultrahigh-dimensional predictors where the dimension of predictors is extremely greater than the sample size, we adopt feature screening techniques to retain predictors that are informative to the response; (ii) to detect network structures of predictors, we employ exponential family graphical models to detect network structure of the selected predictors under the whole dataset or different classes; (iii) use the results in (i) and (ii) to develop network-based classification models to examine class separation and make the prediction for tumor samples.
There are several contributions in the proposed method. First, unlike existing methods that may specify a model when doing feature screening, our feature screening procedure is modelfree and does not need to specify the model formulation. Second, although there exist methods handling network structures in classification, they assume a common network structure for predictors of all subjects without taking into account of possible heterogeneity for different classes. Instead, the proposed method is able to construct predictive models with possibly class-dependent network structures of predictors taken into account. Finally, the proposed method is able to handle multi-class labels with the accommodation of network structures in predictors, which is different from existing methods that either handle multiclassification but not use the information of network structure, or simply accommodate network structure to deal with binary classification.
The remainder is organized as following. In Section 2, we introduce a motivated real dataset and its data structure. In addition, we define the relevant mathematical notation. In Section 3, we give detailed presentation for each step in Fig 1. In Section 4, we implement the proposed method to analyze a real dataset and compare the proposed method with its competitors. A general discussion is presented in Section 5.

Data structure with multi-class responses
In this section, we first introduce a motivated dataset outlined in Section 1. After that, we define mathematical notation to describe the data structure with multi-class responses.

Description of motivated dataset
The data presented in the following are the GCM dataset collected by [30]. This dataset contains 16,063 gene expression values and 198 tumor samples, including 144 training samples (denoted as T ) and 54 testing samples (denoted as V). In addition, 14 common human cancers, including Breast (BR), Prostate (PR), Lung (LU), Colorectal (CO), Lymphoma (LY), Bladder (BL), Melanoma (ML), Uterus (UT), Leukemia (LE), Renal (RE), Pancreas (PA), Ovarym (OV), Mesothelioma (ME) and CNS cancers, are included in the dataset. The sample sizes of each cancer are summarized in Table 1. Our main goal is to classify tumor samples into different categories of cancer according to gene expression values of the samples, which are treated as predictors.
Even though this dataset is no need to pre-processing due to complete observations without missing value, and some of its features having been well analyzed by [30], still, the dataset can be further investigated in two aspects. First of all, we propose to note the issue of high-dimentionality of the data, which usually implies the existence of irrelevant variables, i.e., not every gene expression is dependent upon the response. Therefore, to ensure the accuracy of prediction, it is necessary to exclude irrelevant variables. As a result, it is crucial to select gene expressions that are informative in terms of responses. Secondly, as discussed in [31,32], complex dependence structures may exist among high-dimensional gene expressions. Therefore, to increase the accuracy of predictions, it is necessary to incorporate the network structure of gene expressions into the classification procedure.

Notation
In this subsection, we define mathematical notation to describe the data in order to develop the method.
Suppose the data of n subjects come from I classes, where I is a fixed integer greater than 2 and the classes are nominal. Let n i be the class size in class i with i = 1, � � �, I, and hence n ¼ P I i¼1 n i . Let Y denote the n-dimensional vector of response with the jth component being Y j = i, which reflects the class membership that the jth subject is in the ith class for i = 1, � � �, I and j = 1, � � �, n. Let p > 1 denote the dimension of predictors for each subject. Define X = [X j, l ] as the n × p matrix of predictors for j = 1, � � �, n and l = 1, � � �, p, where the component X j,l represents the lth predictor for the jth subject. Furthermore, let X j• = (X j,1 , � � �, X j,p ) > denote the p-dimensional predictor vector for the jth subject in the jth row of X and let X •k = (X 1,k , � � �, X n,k ) > represent the n-dimensional vector of the kth predictor in the kth column of X. In this paper, we Table 1. Sample sizes for each cancer. The first row with T contains sample sizes of the training data in cancer labels; the second row with V contains sample sizes of the testing data in cancer labels; the last row with "Total" contains sample sizes of the whole data in cancer labels. consider a setting that the dimension of the predictors p is ultrahigher than the sample size n, i.e., p = exp{O(n r )} for some constant r > 0 (e.g., [33]). Without loss of generality, the {X j• , Y j } are treated as independent and identically distributed (i.i.d.) for j = 1, � � �, n. We let lower case letters represent realized values for the corresponding random variables.

BR
The objective of the study is to build models to predict the class label for a new subject with observationX.

Proposed method
In this section, we present detailed estimation procedure for each step as shown in Fig 1.

Feature screening via rank-based correlation coefficient
Let denote the true active set which contains all relevant predictors for the response Y with q ¼ jIj and q < n, and I c is the complement of I that contains all irrelevant predictors for the response Y. Basically, the goal of Step 1 in Fig 1 is to estimate the active set I. When I is determined, then the associated vector of predictors X I ¼ fX �k : k 2 Ig contains important information in terms of the response, and its dimension is smaller than the sample size n. Thus, X I can be adopted to the subsequent analysis.
The remaining concern is to obtain the estimated active set. Following the spirit of [33], we employ the technique of feature screening, whose idea is to take the correlation of the response and the predictors as a signal, and retain the important predictors with large values of signals. We propose to take the rank-based correlation coefficient as the signal. Specifically, for the kth predictor X •k , the rank-based correlation coefficient between X •k and Y is given by (e.g., [34,35]) where Ið�Þ denotes the indicator function and μ(�) is the law of Y. It can be shown that ω k is in an interval [0, 1], and a higher value of ω k indicates a stronger correlation between Y and X •k . Therefore, (1) can be regarded as similar to the classical coefficients such as Pearson's correlation.
To implement this idea, we estimate (1) using the sample data. For j = 1, � � �, n, denote Y (j) as the rearranged response according to the sort of the kth predictors X •k , i.e., ( The corresponding estimator of ω k is given by [34]: where, for j = 1, � � �, n, ' j ≜# fl : Y ðlÞ � Y ðjÞ g, r j ≜# fl : Y ðlÞ � Y ðjÞ g, and # A represents the number of elements in a set A. In applications, one can use the R package XICOR to compute (2). Therefore, the estimated active set based on (2) is given bŷ where c and κ 2 (0, 1/2) are prespecified threshold values. In applications, one can specify c and κ such that variables with the first n log n h i largest values ofô k can be retained, where [�] represents the ceiling function (e.g., [33,35,36]). Different from the conventional feature screening method (e.g., [33]), the main advantage of (3) is model-free feature screening because it does not impose model formulation, and thus, (3) is able to detect predictors that may have nonlinear relationship with the response Y. Theoretically, by the similar derivations of [35], the sure screening property of (3) can be justified. That is, PðI �Î Þ ! 1 as n ! 1, which ensures that the estimated active set contains truly informative predictors that are dependent on the response with a probability approaching one. Moreover, while there are several methods to deal with feature screening, as examined by [35], (2) generally outperforms other existing approaches and is able to handle oscillatory trajectory between the response and predictors.
When the active set is determined, we then let X j;Î ¼ fX j;k : k 2Î g denote the vector containing all the active predictors for the jth subject, and denote x j;Î as the realization values of X j;Î .

The expressions of graphical structure
Since the estimated active setÎ is identified, we now explore the network structure of selected gene expressions inÎ for Step 2 in Fig 1. Graphical models are commonly used strategies to achieve this goal.
The graph is expressed as G = (V, E), where V is the set of the vertices and E � V × V is the set of the edges. In our case, V≜Î is treated as selected predictors withq ¼ jVj and E is regarded as pairwise dependence of any two selected predictors. In graphical model frameworks, we start by formulating the distribution function of selected predictors. In this article, we consider exponential family graphical models because it generalizes the commonly used models. The formulation is given by and C(�) are given functions that reflect the distribution of XÎ (e.g., [20,37]), and the function A(β, Θ) is normalizing constant which ensures (4) to be integrated as 1.
Without loss of general interest, we take B(X j,r ) as the linear function B(X j,r ) = X j,r for r 2 V. In addition, in the graphical model theory, the main interest is the estimation of θ st because of its interpretation that X j,s and X j,t are conditionally dependent if θ st 6 ¼ 0. Therefore, to focus on presenting the estimation of θ st , we drop the main effect term, and consider the following graphical model where the function A(Θ) is normalization constant which makes (5) be integrated as 1.
For the estimation method for Θ, one of the famous methods is the conditional inference [38]. Without loss of generality, we consider the vertex s, and define the neighbourhood set which collect vertexes that are dependent on the vertex s. To estimate the neighbourhood set of s, it suffices to study the inference of X j,s |X j,V\{s} , where X j;Vnfsg ¼ ðX j;1 ; � � � ; X j;sÀ 1 ; X j;sþ1 ; � � � ; X j;q Þ. Let y s ¼ ðy s1 ; � � � ; y sðsÀ 1Þ ; y sðsþ1Þ ; � � � ; y sq Þ denote the ðq À 1Þ-dimensional vector of parameters that is associated with X j,V\{s} . By some algebra, we have where D(�) is a normalization constant ensuring that the integration of (7) is equal to 1. Then the estimator of θ s , denoted asŷ s , is given bŷ where ; k�k 1 is the L 1 -norm and λ is the tuning parameter.
In the penalization problem for selecting the variables, estimating the tuning parameter is also a crucial issue. In this paper, we employ the BIC approach (e.g., [39]) to select the tuning parameter λ. To emphasize the dependence on the tuning parameter, we letŷ s ðlÞ denote the estimator obtained from (8). Define where dffŷ s ðlÞg represents the number of non-zero elements inŷ s ðlÞ for a given λ. The optimal tuning parameter λ, denoted byl, is determined by minimizing (9) within suitable ranges of λ. As a result, the estimator of θ s is determined byŷ s ¼ŷ s ðlÞ. Finally, the estimated neighbourhood set is given bŷ

Multinomial logistic regression with homogeneous network structure in predictors
After obtaining the estimated network structure based on informative predictors, we wish to use such a network structure to examine the classification for different cancers, as demonstrated in Step 3 of Fig 1. Therefore, to incorporate the network structures of the predictors into a prediction model, we present two methods which can be readily implemented using the R package glm for fitting a logistic regression model. In the first method, called the multinomial logistic regression with homogeneous network structure in predictors (MLR-HomoNet), we consider the case where the subjects in different classes share a common network structure in the predictors. To build a prediction model, we make use of the development of the logistic model with multiclass responses ([41], Section 6.1; [42], Section 7.1).
We first identify the pairwise dependence of the predictors using the measurements of all the subjects without distinguishing their class label. Letŷ st be the estimate for θ st obtained for (8) by using all the predictor measurements of fX j;Î : j ¼ 1; � � � ; ng, and letÊ ¼ fðs; tÞ :ŷ st 6 ¼ 0g denote the resulting estimated set of edges. Next, for i = 1, � � �, I and j = 1, � � �, n, we let The estimator of α, denotedâ, can be derived by maximizing (14). In applications, sinceâ has no closed form, we usually implement the Newton-Raphson algorithm to (14) and obtain the resulting estimator. Therefore, for the realization x j;Î of the q-dimensional vector X j;Î , p i ðx j;Î Þ is estimated aŝ for i = 1, � � �, I − 1, and p I ðx j;Î Þ is estimated aŝ Finally, to predict the class label for a new subject with a selectedq-dimensional predictor instancex, we first calculate the right-hand side of (15) and (16) To the end, we summarize key steps in Sections 3.1-3.3 in Algorithm 1.

Algorithm 1: MLR-HomoNet
Under the training data T ; Step 1: Determine informative predictors Apply (2) to do feature screening and retain n log n h i predictors among p-dimensional predictors. A set of selected predictors is given by (3).
Step 2: Determine the network structure of predictors Based on selected predictors inÎ , use (8) to determine pairwise dependence structure and obtain (11). The resulting network structure is formed byÊ.

Logistic regression with heterogeneous network structured in predictors
We now present an alternative method to that described in Section 3.3. Instead of pooling all the predictors to feature the predictor network structure, this method, called the logistic regression with heterogeneous network structured in predictors (LR-HeteNet), stratifies the predictor information by class when characterizing the predictor network structures. The implementation is summarized in Algorithm 2.
Step 1: Class-dependent active set Apply (18) to do feature screening and retain n i log n i h i predictors among p-dimensional predictors. A set of selected predictors for class i is given by (19).
Step 2: Class-dependent predictor network Based on selected predictors inĴ i , use (8) to determine pairwise dependence structure and obtain (11). DenoteÊ i as the resulting network structure.
Step 3: Class-dependent predictive model Given a initial value g ð0Þ i , then perform the Newton-Raphson algorithm; for step t with t = 1, 2, � � �, T, say T = 1000 do Step 3.1: calculate the score function evaluated at the tth iterated value: where p i ðx j;Ĵ i ; g ðtÞ i Þ is (20) with parameters replaced by g ðtÞ i ; Step 3.2: calculate the Henssian matrix evaluated at the tth iterated value: Step 3.3: update g Be more specific, under the training data T , we first introduce a binary, surrogate response variable for every i = 1, � � �, I and j = 1, � � �, n. Let After that, we adopt the similar strategy in Algorithm 1 to construct predictive models for class i. Specifically, in Step 1 of Algorithm 2, let J i ¼ fk : X �k is dependent on Y i g denote the true active set of the class i which contains all relevant predictors for the response Y i with jJ i j < n i . Following (2), the signal of X •k and Y i is defined as o i k ≜xðX �k ; Y i Þ, and it can be estimated byô where, for j = 1, � � �, n, ' i j ≜# fl : Y i ðlÞ � Y i ðjÞ g and r i j ≜# fl : Y i ðlÞ � Y i ðjÞ g with Y i ðjÞ being the rearranged response according to the sort of the kth predictors X •k . Therefore, J i can be estimated asĴ where c i and κ i 2 (0, 1/2) are some prespecified threshold values. Let X j;Ĵ i ¼ fX j;k : k 2Ĵ i g denote the vector of all the active predictors that depends on Y i for the jth subject. Moreover, since Y i is defined as the response with binary outcomes, similar derivations in [35] show that (18) is valid to measure the dependence between categorical and continuous variables, and the point-biserial correlation coefficient is a special case of (18). In Step 2 of Algorithm 2, let V i ≜Ĵ i denote the vertex set containing predictors that are dependent on the class i = 1, � � �, I. We apply the procedure described in Section 3.2 to determine the network structure of predictors in the class i. LetÊ i ¼ fðs; tÞ :ŷ i st 6 ¼ 0g denote an estimated set of edges for the class i, whereŷ i st is the estimate of θ st derived from (8) based on using the predictor measurements in the class i.
After that, Step 3 in algorithm 2 aims to fit a logistic regression model using the surrogate response vector Y i with the estimated predictors network structureÊ i incorporated for i = 1, Þ and consider the parametric logistic regression model where j = 1, � � �, n, g i ≜ðg i0 ; g > i� Þ with g i� ¼ ðg i;st : ðs; tÞ 2Ê i Þ > is the vector of parameters associated with class i. In the spirit of the maximum likelihood estimation (MLE) method (e.g., [42]), the log-likelihood function of (20) is given by and the estimator of γ i , denotedĝ i ≜ðĝ i0 ;ĝ > i� Þ, is obtained by maximizing (21). In applications, we implement the Newton-Raphson algorithm to obtainĝ i ; the detailed procedure is summarized in Algorithm 2. Consequently, for the realization x j;Ĵ i of the jĴ i j-dimensional vector X j;Ĵ i , based on (20), p i ðx j;Ĵ i Þ can be estimated bŷ for i = 1, � � �, I.
Finally, when predictive models based on the training data T are obtained, we now examine the prediction for the testing data V in Step 4 of Algorithm 2. Letx j;Ĵ i denote a jĴ i j-dimensional predictor vector for a new subject. We calculate (22) with x j;Ĵ i replaced byx j;Ĵ i for i = 1, � � �, I, and letp 1 ; � � � ;p I denote the corresponding values. Let i � denote the index which corresponds to the largest value of fp 1 ; � � � ;p I g, i.e., Then the class label for this new subject is predicted as i � .

Results
In this section, we aim to implement Algorithms 1 and 2 in Section 3 to the GCM dataset introduced in Section 2.1.

Detection of informative gene expressions via feature screening
In the GCM dataset, there are I = 14 classes. The dimension of predictors is p = 16, 063 and the sample size is n = 198, where the size of the training set is 144 and the size of the testing set is 54. Following steps in Fig 1, we first implement the proposed method in Section 3 to fit models based on the training set, and then assess the performance of prediction by examining the testing set.
Since the dimension of predictors is extremely larger than the sample size, i.e., p � n, to determine the informative predictors, we adopt the screening signal (2) to retain informative gene expressions. The first strategy in Algorithm 1 is to apply (2) to evaluate the signal of X •k and Y 2 {1, � � �, 14} and determine the estimated active set (3); the second consideration in Algorithm 2 is to calculate the signal of X •k and Y i for i = 1, � � �, 14 and then obtain the estimated class-dependent active set (19)

Network-based classification models
After the feature screening step, we next apply the estimation procedure in Section 3.2 to determine the network structure of selected gene expressions in the training set. To adopt the determined network structures to examine the classification, we implement the network structures and the training set to the classification models proposed in Sections 3.3 and 3.4, respectively. To see the fitness of two models, we first implement the training data to the fitted models and examine the classification. The 14×14 confusion matrices based on the MLR-HomoNet and LR-HeteNet methods are shown in Tables 2 and 3, respectively, where columns are labels from the training data T , rows are labels of fitted values, diagonal entries reflect number of correct classification, and nondiagonal entries are number of misclassification by fitted values. In general, both methods show satisfactory model fittness as the accuracy of classification is high. Moreover, we observe that the LR-HeteNet method seems to slightly outperform the MLR-HomoNet method since the latter method produces slightly larger misclassification on BR, PR, CO, and UT than those of the former method. This result makes sense because the LR-HeteNet method is based on class-dependent network structure that can directly reflect the corresponding cancers. For a clear visualization, we further display two heatmaps in Fig 4, which are obtained by Tables 2 and 3 with each row divided by the classdependent sample size in the training data. We observe that diagonal entries have dark color, which indicate that the proportion of true classification is high and Algorithms 1 and 2 give well-fitted models.

Prediction
When the predictive models are constructed, we now assess the performance of the proposed method by examining the prediction for the testing data. We implement the predictors in the testing data to the two proposed methods, and then make the prediction of classification. After that, we summarize the response in the testing data and the predictive classes to 14 × 14 confusion matrices in Tables 4 and 5, respectively, where columns are labels from the testing samples V, rows are labels of predicted values, diagonal entries reflect number of correct classification, and nondiagonal entries are number of misclassification by predicted values. Moreover, we also display two heatmaps in Fig 5 that are obtained by Tables 4 and 5 with each row divided by the class-dependent sample size in the testing data. From confusion matrices and heatmaps, We can see that two proposed methods have satisfactory performance in prediction because most of predicted classes are the same as class labels in the testing data, except for little misclassification.  BR  PR  LE  CO  LU  BL  CNS  UT  LY  RE  PA  OV  ME  ML   BR  6  1  0  0  0  1  1  0  0  1  1   To assess the performance of classification and prediction numerically, we evaluate some commonly used criteria: micro averaged metrics, macro averaged metrics, and the adjusted Rand index. For a subject j in the testing data with j = 1, � � �, 54, letŷ new;j denote the predicted class label determined by the prediction models and let y new,j denote the class label in the testing data. For class i = 1, � � �, I, we respectively calculate the number of the true positives (TP), the number of the false positives (FP), and the number of the false negatives (FN) as Iðy new;j 6 ¼ i;ŷ new;j ¼ iÞ; ð25Þ For micro averaged metrics, precision and recall are, respectively, defined in terms of (24), (25), and (26): and Then Micro-F-score is defined as On the other hand, for macro averaged metrics, for i = 1, � � �, I, let PRE i ¼ TP i TP i þFP i denote precision for class i, and let REC i ¼ TP i TP i þFN i denote recall for class i. Then the overall precision and recall are, respectively, given by and and Macro-F-score is defined as As mentioned in [43], ARI is bounded above by one, and higher value of ARI indicates accurate classification. We primarily adopt (27), (28), (29), (30), (31), (32), and (33) to assess the performance of two proposed methods. In addition, to compare with the proposed methods, we also examine several well established supervised learning methods, including logistic regression models without incorporating network structure [42], the support vector machine (SVM) that was examined by [30], K-nearest neighbor (KNN), linear discriminant analysis (LDA), Bayes, artificial neural network (ANN), XGBoost, random forest (RF), bagging, and long short-term memory (LSTM) methods. The implementation of corresponding R packages is summarized in Table 6.
The prediction results of the proposed and competitive methods are summarized in Table 7. In general, we can observe that the two proposed methods have the largest values of PRE, REC, F-score, and ARI than other existing methods. For the comparisons among existing methods, we can see that advanced machine learning or deep learning methods (e.g., ANN, RF, Bagging) outperform the conventional ones, such as LDA or SVM, but are less satisfactory than the proposed methods because of slightly large misclassification. It verifies that incorporating network structures would improve the accuracy of classification and prediction. In addition, the other reason is that, unlike existing methods that possibly incur overfitting because of direct implementation of all gene expression values to fit models, the two proposed methods simply retain gene expression values and detect network structures that are related to the response, yielding parsimonious models. In this way, noises and impacts induced by irrelevant gene expression values can be eliminated. Compared with two proposed methods, we can see that the LR-HeteNet method outperforms the MLR-HomoNet method with larger values of criteria. The main reason is that the MLR-HomoNet model in Section 3.3 directly deals with multi-label classification by using a common network structure to classify tumors to the corresponding cancers. To simultaneously reflect information to all classes, the network structure displayed in Fig 2 is expected to require more gene expression values and complex interactions. On the other hand, the LR-HeteNet method in Section 3.4 identifies predictors and unique network structure to reflect a specific cancer, suggesting that types of cancers can be uniquely represented by different network structures of gene expression values. As shown in Fig 3, one can directly adopt a given network structure to classify tumors to their cancers with high accuracy of prediction. In summary, with noise induced by irrelevant predictors removed and informative network structures of predictors accommodated, the accuracy of classification and prediction has significant improvement.

Discussion
In this paper, we present the network-based classification method to predict the classification of the tumor samples, which is an ultrahigh dimensional system, i.e., with multitudinous gene expressions as predictors. In the proposed method, we first adopt model-free feature screening technique to retain informative gene expressions from ultrahigh-dimensional data. After that, we identify the network structures of the detected gene expressions based on different cancers, and the property of the network structure recovery allows us to fit the nominal logistic regression based on the network structure and examine the classification and prediction. Compared with other existing methods, the proposed method gives more precise prediction results. There are several possible extensions based on the current work. For example, the RNA sequences, regarded as count data, are also frequently explored in bioinformatics. The proposed method can be naturally extended to deal with the RNA sequence data by treating them as the predictors because the signal of detecting predictors (2) is free of distribution of random variables, and the identification of network structure in Section 3.2 is based on exponential family graphical models. For the implementation of classification models, it is interesting to explore other machine learning methods, such as SVM, LDA, or KNN, and other deep learning approaches that are popular in data science.
Moreover, the research gap still exists and more explorations can be done by extending the proposed method. For example, as discussed in [32], measurement error in predictors is ubiquitous in data analysis, especially that mismeasurement is inevitable in gene expression data (e. g, [52]). Ignoring measurement error effects is expected to increase the possibility of false classification and lead to wrong conclusion. Therefore, it is important to develop a new erroreliminating strategy to deal with measurement error based on the current method. Finally, as R packages associated with some of the existing methods have been developed, the new method proposed here anticipates a corresponding R package.